Unit 3: Background Removal with Robust PCA

Fast.ai Computational Linear Algebra Lecture 4

WNixalo - 2018/3/7


The Goal:

Getting Started

Using the real video 003 dataset from BMC 2012 Background Models Challenge Dataset

Other sources of datasets:

Imports :


In [4]:
import moviepy.editor as mpe
# from IPython.display import display
from glob import glob

import sys, os
import numpy as np
import scipy

import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
# MAX_ITERS = 10
TOL = 1e-8

In [8]:
video = mpe.VideoFileClip("data/Video_003/Video_003.avi")

In [11]:
video.subclip(0,50).ipython_display(width=300)


100%|█████████▉| 350/351 [00:00<00:00, 781.87it/s]
Out[11]:

In [12]:
video.duration  # seconds


Out[12]:
113.57

Helper Methods


In [63]:
def create_data_matrix_from_video(clip, k=5, scale=50):
    return np.vstack([scipy.misc.imresize(rgb2gray(clip.get_frame(i/float(k))).astype(int),
                      scale).flatten() for i in range(k * int(clip.duration))]).T

def rgb2gray(rgb):
    return np.dot(rgb[...,:3], [0.299, 0.587, 0.114])

def plt_images(M, A, E, index_array, dims, filename=None):
    f = plt.figure(figsize=(15, 10))
    r = len(index_array)
    pics = r*3
    for k, i in enumerate(index_array):
        for j, mat in enumerate([M, A, E]):
            sp = f.add_subplot(r, 3, 3*k + j + 1)
            sp.axis('Off')
            pixels = mat[:,i]
            if isinstance(pixels, scipy.sparse.csr_matrix):
                pixels = pixels.todense()
            plt.imshow(np.reshape(pixels, dims), cmap='gray')
    return f

def plots(ims, dims, figsize=(15,20), rows=1, interp=False, titles=None):
    if type(ims[0]) is np.ndarray:
        ims = np.array(ims)
    f = plt.figure(figsize=figsize)
    for i in range(len(ims)):
        sp = f.add_subplot(rows, len(ims)//rows, i+1)
        sp.axis('Off')
        plt.imshow(np.reshape(ims[i], dims), cmap="gray")

Load and View Data

We're scaling an image from 1 moment in time to 60 x 80 pixels. We can unroll that picture into a single tall column. Instead of a 2D 60x80 picture, we have a 1 x 4800 column.

This isn't v.human-readable, but it's useful bc it lets us stack images from different times on top of one another, to put them all into 1 matrix. If we took a frame every 1/100th of a second for 113 s (11,300 images through time), we'd have a 11300 x 4800 matrix representing the video.


In [17]:
scale = 25  # Adjust scale to change resolution of image. eg 25% scale here.
dims  = (int(240*(scale/100)), int(320*(scale/100))) # orig res: 320x240 (x,y)

In [18]:
M = create_data_matrix_from_video(video, 100, scale) # so k is ~fps
# M = np.load("lores_surveillance_matrix.npy")

In [19]:
print(dims, M.shape)


(60, 80) (4800, 11300)

In [21]:
# display 141st scaled frame
plt.imshow(np.reshape(M[:,140], dims), cmap='gray');


Note: course notebook image is 'hi-res' bc R.Thomas did this first in full resolution (althought that'd be v.slow here).

Since create_data_from_matrix is somewhat slow, we'll save our matrix. Ingenrl, whenever you have slow pre-processing steps, it's a good idea to save the results for future use.


In [24]:
np.save("lores_surveillance_matrix.npy", M)

NOTE: only run the below 2 lines on the lo-res (25%) version


In [25]:
plt.figure(figsize=(12,12))
plt.imshow(M, cmap='gray')


Out[25]:
<matplotlib.image.AxesImage at 0x180dbf6a58>

That's what the whole video looks like. I'm gonna throw a wild guess that those lines (non-horizontals) are correlated with the movements in the video.

Oooo I was right.

Questions: What are those wavy black lines? What are the horizontal lines?

$\longrightarrow$ The wavy lines are the people moving in the video. The horizontal lines are the background.

Each column is a frame / moment in time. So smth like a stopsign will just stay static if the camera doesn't move.

The horizontal lines tell us that this matrix is low-rank. Rank is number of independant columns $\rightarrow$ and that turns out to be equal to the number of independant rows (?yeah?).

$\longrightarrow$ put another way: rank is the dimension of the column space or row space (which is eql to num of lin-indep cols or rows)

SVD

Review from Lesson 2: (Unit 2)

  • What kind of matrices are returned by SVD?
  • What is a way to speed up Truncated SVD?

A first attempt with SVD


In [26]:
from sklearn import decomposition

In [27]:
u, s, v = decomposition.randomized_svd(M, 2) 
# 2nd arg is num components, ie singular values, you want back. Trial&Error.

In [28]:
low_rank = u @ np.diag(s) @ v

USV is low rank because even though U has a ton of nonzero values, S is just 2 singular values (the components we told tSVD we wanted). Likewise V (or V$^T$) is 2 x ton-of-vals. So when you multiply USV, there'll only be 2 linearly-independant columns.

Linear Algebra is trippy and awesome. No wonder it makes people freak out about AIpocalpyses the moment you machine-power it.


In [33]:
u.shape, s.shape, v.shape


Out[33]:
((4800, 2), (2,), (2, 11300))

NB: Σσ/Ss are getting mixed up. R.Thomas is using Ss as Σσ. The FB research post defines U=QS; Σ is the singular value matrix, while U.Lisboa calls Σ S. But R.Thomas also uses s for Σ's values. Anyway, I got it now.


In [30]:
low_rank.shape


Out[30]:
(4800, 11300)

In [31]:
plt.imshow(np.reshape(low_rank[:,140], dims), cmap='gray')


Out[31]:
<matplotlib.image.AxesImage at 0x181ab6e780>

We can take a look at our low-rank reconstructed matrix:


In [35]:
plt.figure(figsize=(12,12))
plt.imshow(low_rank, cmap='gray');


You can think of this as a data-compression problem. With just rank-2 (2 cols in U, 2 σs, 2 rows in V$^T$), we get this much of our original matrix back.

Now what if we just want to get the people back out of the video, instead of the background? That's as simple as subtracting:


In [39]:
plt.imshow(np.reshape(M[:,140] - low_rank[:,140], dims), cmap='gray');


Zooming in (this is better w/ hires)


In [42]:
plt.imshow(np.reshape(M[:,140] - low_rank[:,140], dims)[15:40,25:65], cmap='gray');


Okay. That's really cool that we can do so much with such a simple algorithm. Now how much better cold we do with a more sophisticated algo..

Principal Component Analysis (PCA)

WHen dealing with high-dim datasets, we often leverage on the fact that the data has low intrinsic dimensionality iot alleviate the 'curse of dimensionality and scale' (maybe the data lies in a low-dim subspace or a low-dim manifold). Principal Component Analysis is useful for eliminating dimensions. Classical PCA seeks the best rank-k estimate L of M (minimizing $\lvert\lvert M - L\lvert\lvert$ where L has rank-k)

Traditional PCA can handle small noise, but is brittle wrt grossly corrupted observations -- even 1 g.c.o. can significantly mess up the answer.

Robust PCA factors a matrix into the sum of 2 matrices, M = L + S, where M is the orign. matx, L is low-rank, and S is sparse. This is what we'll be using for background removal.

Low-Rank means the matrix has lots of redundant (lindep) information -- in this case it's the background, which is identical in every frame.

Sparse means the matrix has mostly zero entries -- in this case see how the foreground picture (of people) is mostly empty). In the case of corrupted data, S captures corrupted entries.

When working w/ a hi-dim dataset, you want to use the fact that the data usually has a low intrinsic dimensionality -- ie: not all its dims have information.

Appliations of Robust PCA

  • Video Surveillance
  • Face Recognition photos from this tutorial. The dataset here consists of images of faces of several people at the same angle but different lightings.

example image

example image 2

  • Latent Semantic Indexing: $L$ captures common words used in all documents while $S$ captures the few key words that best disinguish each document from others

  • Ranking and Collorative Filtering: a small portion of the available rankings could be noisy and even tampered with (see Netflix RAD - Outlier Detection on Big Data)

L1 Norm induces sparcity

The unit ball $\lvert\lvert x \lvert\lvert{}_{1}= 1$ is a diamond in the L1 Norm. It's extrema are the corners:

A similar perspective is to look at the contours of the loss function:

basically:

L1 norm wants to keep nonzero terms to a minimum, even if some weights get larger. L2 norm wants to minimize Sum-of-Squares, so it wants everything as small as possible $\Longrightarrow$ more nonzero terms.

L1 = (Σ|x$_i$|)$^1$ ; L2 = (Σx$_i$)$^{1/2}$

Optimization Problem

Robust PCA can be written:

$$minimize \ \lvert\lvert L \lvert\lvert{_*} + λ \lvert\lvert S \lvert\lvert {_1}$$$$subject \ to: \ L + S = M$$

where:

  • ||•||$_1$ is the L1 norm. Minimizing the L1 norm results in sparse values. For a matrix, the L1 norm is equal to the maximum absolute column norm.
  • ||•||$_*$ is the nuclear norm, which is the L1 norm of the singular values. Minimizing this results in sparse singular values $\longrightarrow$ low rank.

English: the way we describe a sparse matrix is by minimizing the L1 norm; a low-rank matrix by minimizing the nuclear norm (which is the L1 norm of the singular values).

$\longrightarrow$ ie: if a lot of singular values are zero, you have a low-rank matrix. This leads into Optimization and courses taught by Stephen Boyd (linked below)

Implementing an Algorithm from a paper

Source

We'll use the general Primary Component Pursuit Algorithm from this Robust PCA paper, in the specific form of Matrix Completion via the Inexact ALM Method in §3.1 here -- Chen-Lin-Ma. Also referenced implementations here and here.

PCPA is one way of doing R-PCA, which is decomposing a matrix into a sum of a low-rank and a sparse matrix.

The Good Parts

§1: Intro, and §5: Algorithms are our main interests.

You don't need to know the math or understand the proofs to implement an algorithm from a paper. There are (as of 2017-18) many theoretical researchers, and little pragmatic advice.

Algorithm on p29:

Definitions:

$\mathcal{S}$ : shrinkage operator

$\mathcal{D}$ : singular value thresholding operator

From the paper:

$\longrightarrow$ so the Shrinkage Operator is subtracting a value τ off of the singular values, floor-clamped to zero.

S$_τ$[x] = sign(x)*max(|x| - τ, 0)

The Singular Value Thresholding Operator takes the SVD, makes the singular values smaller, then recomposes the matrix. $\longrightarrow$ decreases the magnitude a bit, but most importantly sets some σ's to zero $\rightarrow$ makes them more sparse.

$\longrightarrow$ similar to doing truncated SVD. Every σ < τ $\rightarrow$ 0.

which makes sense bc we just learned that that makes a low-rank matrix

Back to the Algorithm:

In 3.: the low-rank matrix is created from singular-value thresholding.

In 4.: the sparse matrix is created looking at the remaining error in the original matrix $-$ the low-rank matrix (bc you want your Low-Rank & Sparse Mats to sum to the original), and the $μ^{-1}Y_k$ is keeping track of what's still left to approximate.

THE BASIC IDEA of all this is to iteratively go back and forth between improving the Low-Rank matrix estimate and Sparse matrix estimate. And the Singular Value Threshold is v.similar to truncating a bunch of singular values.

§3.1 of this paper - Chen-Lin-Ma has a faster variation of this Algorithm:

ALM: Alternating Lagrance Multipliers

§4 has some helpful implementation details on how many singular values to calculate (& how to choose parameter vals):

If the singular value is small enough you just increment (huh?) otherwise you take ~20% of your dimensionality. ie: you don't calculate more than 20% of whatever the size of your matrix is.

If you want to learn more theory:

Robust PCA (via Primary Component Pursuit)

Methods

We'll use Facebook's Fast Randomized PCA library.


In [44]:
from scipy import sparse
from sklearn.utils.extmath import randomized_svd
import fbpca

In [45]:
TOL = 1e-9
MAX_ITERS = 3

In [151]:
def converged(Z, d_norm):
    err = np.linalg.norm(Z, 'fro') / d_norm  # 'fro': Frobenius Norm
    print('error: ', err)
    return err < TOL

# the Shrinkage operator
def shrink(M, tau):
    S = np.abs(M) - tau
    return np.sign(M) * np.where(S > 0, S, 0)

# fast SVD implnt frm Facebook -- github.com/facebook/fbpca
def _svd(M, rank):
    return fbpca.pca(M, k=min(rank, np.min(M.shape)), raw=True)
    # NOTE: this is SVD, but Facebook calls it PCA..

def norm_op(M):
    return _svd(M, 1)[1][0]

def svd_reconstruct(M, rank, min_sv):
    u, s, v = _svd(M, rank)  # take SVD
    s -= min_sv              # subtract off from singular values
    nnz = (s > 0).sum()
    return u[:,:nnz] @ np.diag(s[:nnz]) @ v[:nnz], nnz  # reconstruct

def pcp(X, maxiter=10, k=10):  # refactored
    m, n = X.shape
    trans = m < n                      # transpose the matrix:
    if trans: X = X.T; m, n = X.shape  # you want it to be tall & skinny

    lamda = 1/np.sqrt(m)
    op_norm = norm_op(X)
    Y = np.copy(X) / max(op_norm, np.linalg.norm(X, np.inf) / lamda)
    μ = k*1.25/op_norm; μ_bar = μ * 1e7; rho = k * 1.5
    
    d_norm = np.linalg.norm(X, 'fro')
    L = np.zeros_like(X); sv = 1
    
    examples = []
    
    for i in range(maxiter):
        print("rank sv:", sv)
        X2 = X + Y/μ
              
        # update estimate of Sparse Matrix by "shrinking/truncating": original - low-rank
        S = shrink(X2 - L, lamda/μ)
              
        # update estimate of Low-Rank Matrix by doing truncated SVD of rank sv & reconstructing.
        # count of singular values > 1/μ is returned as svp
        L, svp = svd_reconstruct(X2 - S, sv, 1/μ)
        
        # If svp < sv, youre already calculating enough singular values.
        # If not, add 20% (in this case 240) to sv
        sv = svp + (1 if svp < sv else round(0.05*n))
              
        # residual
        Z = X - L - S
        Y += μ*Z; μ *= rho
        
        examples.extend([S[140,:], L[140,:]])
              
        if m > μ_bar: m = μ_bar
        if converged(Z, d_norm): break
    if trans: L=L.T; S=S.T
    return L, S, examples

SVD & PCA are different but similar. PCA uses SVD. PCA typically involves multiplying a matrix by its transpose first, it also subtracts the mean.

Careful: subtracting the mean from a sparse matrix makes it no longer sparse.

Results


In [147]:
L, S, examples = pcp(M, maxiter=5, k=10)


rank sv: 1
error:  0.13163793711394506
rank sv: 241
error:  0.045851568879608784
rank sv: 49
error:  0.005924598861072088
rank sv: 289
error:  0.0005656103252910875
rank sv: 529
error:  2.4546734769803146e-05

In [148]:
plots(examples, dims, rows=5)



In [149]:
f = plt_images(M, S, L, [0, 140, 1000], dims)


NOTE: I was having a huge bugging issue here with my code not reproducing the results from the course notebook. After a couple hours I found out I was setting μ += rho instead of μ *= rho ....


In [150]:
# np.save("lores_L.npy", L)
# np.save("lores_S.npy", S)

Extracting a bit of the foreground is easier than identifying the background. TO accurately get the background, you need to remove the entire foreground, not just parts of it.

LU Factorization


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: